class: center, middle, inverse, title-slide # Tidying data ##
Data Science & Statistics ### Gavin McNicol ### 2021-09-27 --- class: middle # .hand[We...] .huge[.green[have]] .hand[data organised in an unideal way for our analysis] .huge[.pink[want]] .hand[to reorganise the data to carry on with our analysis] --- ## Data: Sales <br> .pull-left[ ### .green[We have...] ``` ## # A tibble: 2 × 4 ## customer_id item_1 item_2 item_3 ## <dbl> <chr> <chr> <chr> ## 1 1 bread milk banana ## 2 2 milk toilet paper <NA> ``` ] -- .pull-right[ ### .pink[We want...] ``` ## # A tibble: 6 × 3 ## customer_id item_no item ## <dbl> <chr> <chr> ## 1 1 item_1 bread ## 2 1 item_2 milk ## 3 1 item_3 banana ## 4 2 item_1 milk ## 5 2 item_2 toilet paper ## 6 2 item_3 <NA> ``` ] --- ## A grammar of data tidying .pull-left[ <img src="data:image/png;base64,#img/tidyr-part-of-tidyverse.png" width="60%" /> ] .pull-right[ The goal of tidyr is to help you tidy your data via - pivoting for going between wide and long data - splitting and combining character columns - nesting and unnesting columns - clarifying how `NA`s should be treated ] --- class: middle # Pivoting data --- ## Not this... <img src="data:image/png;base64,#img/pivot.gif" width="70%" /> --- ## but this! .center[ <img src="data:image/png;base64,#img/tidyr-longer-wider.gif" width="45%" style="background-color: #FDF6E3" /> ] --- ## Wider vs. longer .pull-left[ ### .green[wider] more columns ``` ## # A tibble: 2 × 4 ## customer_id item_1 item_2 item_3 ## <dbl> <chr> <chr> <chr> ## 1 1 bread milk banana ## 2 2 milk toilet paper <NA> ``` ] -- .pull-right[ ### .pink[longer] more rows ``` ## # A tibble: 6 × 3 ## customer_id item_no item ## <dbl> <chr> <chr> ## 1 1 item_1 bread ## 2 1 item_2 milk ## 3 1 item_3 banana ## 4 2 item_1 milk ## 5 2 item_2 toilet paper ## 6 2 item_3 <NA> ``` ] --- ## `pivot_longer()` .pull-left[ - `data` (as usual) ] .pull-right[ ```r pivot_longer( * data, cols, names_to = "name", values_to = "value" ) ``` ] --- ## `pivot_longer()` .pull-left[ - `data` (as usual) - `cols`: columns to pivot into longer format ] .pull-right[ ```r pivot_longer( data, * cols, names_to = "name", values_to = "value" ) ``` ] --- ## `pivot_longer()` .pull-left[ - `data` (as usual) - `cols`: columns to pivot into longer format - `names_to`: name of the column where column names of pivoted variables go (character string) ] .pull-right[ ```r pivot_longer( data, cols, * names_to = "name", values_to = "value" ) ``` ] --- ## `pivot_longer()` .pull-left[ - `data` (as usual) - `cols`: columns to pivot into longer format - `names_to`: name of the column where column names of pivoted variables go (character string) - `values_to`: name of the column where data in pivoted variables go (character string) ] .pull-right[ ```r pivot_longer( data, cols, names_to = "name", * values_to = "value" ) ``` ] --- ## Customers `\(\rightarrow\)` purchases ```r purchases <- customers %>% * pivot_longer( * cols = item_1:item_3, # variables item_1 to item_3 * names_to = "item_no", # column names -> new column called item_no * values_to = "item" # values in columns -> new column called item * ) purchases ``` ``` ## # A tibble: 6 × 3 ## customer_id item_no item ## <dbl> <chr> <chr> ## 1 1 item_1 bread ## 2 1 item_2 milk ## 3 1 item_3 banana ## 4 2 item_1 milk ## 5 2 item_2 toilet paper ## 6 2 item_3 <NA> ``` --- ## Why pivot? Most likely, because the next step of your analysis needs it -- .pull-left[ ```r prices ``` ``` ## # A tibble: 5 × 2 ## item price ## <chr> <dbl> ## 1 avocado 0.5 ## 2 banana 0.15 ## 3 bread 1 ## 4 milk 0.8 ## 5 toilet paper 3 ``` ] .pull-right[ ```r purchases %>% * left_join(prices) ``` ``` ## Joining, by = "item" ``` ``` ## # A tibble: 6 × 4 ## customer_id item_no item price ## <dbl> <chr> <chr> <dbl> ## 1 1 item_1 bread 1 ## 2 1 item_2 milk 0.8 ## 3 1 item_3 banana 0.15 ## 4 2 item_1 milk 0.8 ## 5 2 item_2 toilet paper 3 ## 6 2 item_3 <NA> NA ``` ] --- ## Purchases `\(\rightarrow\)` customers .pull-left-narrow[ - `data` (as usual) - `names_from`: which column in the long format contains the what should be column names in the wide format - `values_from`: which column in the long format contains the what should be values in the new columns in the wide format ] .pull-right-wide[ ```r purchases %>% * pivot_wider( * names_from = item_no, * values_from = item * ) ``` ``` ## # A tibble: 2 × 4 ## customer_id item_1 item_2 item_3 ## <dbl> <chr> <chr> <chr> ## 1 1 bread milk banana ## 2 2 milk toilet paper <NA> ``` ] --- class: top # Case study: Deforestation rates in Brazil <img src="data:image/png;base64,#img/deforestation.jpeg" width="70%" /> --- <img src="data:image/png;base64,#img/deforestation-trends-brazil.png" width="90%" /> .footnote[ Source: [Our World in Data](https://ourworldindata.org/forests-and-deforestation/) ] --- ## Data ```r brazil_loss ``` ``` ## # A tibble: 13 × 14 ## entity code year commercial_crops flooding_due_to… natural_disturb… pasture ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Brazil BRA 2001 280000 0 0 1520000 ## 2 Brazil BRA 2002 415000 79000 35000 2568000 ## 3 Brazil BRA 2003 550000 0 35000 2761000 ## 4 Brazil BRA 2004 747000 26000 22000 2564000 ## 5 Brazil BRA 2005 328000 17000 26000 2665000 ## 6 Brazil BRA 2006 188000 17000 26000 1861000 ## 7 Brazil BRA 2007 79000 9000 22000 1577000 ## 8 Brazil BRA 2008 52000 0 17000 1345000 ## 9 Brazil BRA 2009 57000 9000 31000 847000 ## 10 Brazil BRA 2010 100000 0 44000 616000 ## # … with 3 more rows, and 7 more variables: selective_logging <dbl>, ## # fire <dbl>, mining <dbl>, other_infrastructure <dbl>, roads <dbl>, ## # tree_plantations_including_palm <dbl>, small_scale_clearing <dbl> ``` --- ## Goal <img src="data:image/png;base64,#u2-d09-tidying_files/figure-html/unnamed-chunk-20-1.png" width="50%" /> --- ## Pivot ```r brazil_loss_longer <- brazil_loss %>% * pivot_longer( * 4:14, * names_to = "fate", * values_to = "area" * ) brazil_loss_longer ``` ``` ## # A tibble: 143 × 5 ## entity code year fate area ## <chr> <chr> <dbl> <chr> <dbl> ## 1 Brazil BRA 2001 commercial_crops 280000 ## 2 Brazil BRA 2001 flooding_due_to_dams 0 ## 3 Brazil BRA 2001 natural_disturbances 0 ## 4 Brazil BRA 2001 pasture 1520000 ## 5 Brazil BRA 2001 selective_logging 96000 ## 6 Brazil BRA 2001 fire 26000 ## 7 Brazil BRA 2001 mining 9000 ## 8 Brazil BRA 2001 other_infrastructure 9000 ## 9 Brazil BRA 2001 roads 13000 ## 10 Brazil BRA 2001 tree_plantations_including_palm 44000 ## # … with 133 more rows ``` --- ## Plot ```r *ggplot(brazil_loss_longer, * aes(x = year, y = area, color = fate)) + * geom_line() ``` <!-- --> --- .panelset[ .panel[.panel-name[Code] ```r ggplot(brazil_loss_longer, aes(year, area, color = fate)) + geom_line() + * labs( * x = "Year", y = "Area Deforested", * color = NULL, * title = "What are the causes of deforestation?", * subtitle = "Loss of Brazilian forest due to specific types", * caption = "Source: Our World in Data" * ) ``` ] .panel[.panel-name[Plot] <img src="data:image/png;base64,#u2-d09-tidying_files/figure-html/unnamed-chunk-23-1.png" width="30%" /> ] ] --- .panelset[ .panel[.panel-name[Code] ```r ggplot(brazil_loss_longer, aes(year, area, color = fate)) + geom_line() + labs( x = "Year", y = "Area Deforested", color = NULL, title = "What are the causes of deforestation?", subtitle = "Loss of Brazilian forest due to specific types", caption = "Source: Our World in Data" ) + * theme_minimal() ``` ] .panel[.panel-name[Plot] <img src="data:image/png;base64,#u2-d09-tidying_files/figure-html/unnamed-chunk-24-1.png" width="30%" /> ] ] --- .center[ .large[ This class content was built from the Data Science in a Box source materials. https://datasciencebox.org/index.html ] ]